web : sbc.shef.ac.uk
twitter: [@SheffBioinfCore](https://twitter.com/SheffBioinfCore)
email: bioinformatics-core@sheffield.ac.uk
This tutorial will cover the basics of RNA-seq using Galaxy and Degust; two open-source web-based platforms for the analysis of biological data. You should gain an appreciation of the tasks involved in a typical RNA-seq analysis and be comfortable with the outputs generated by the Bioinformatician.
Sections with this background indicate exercises to be completed during the workshop.
Sections with this background highlight particular shortcuts or other references that might be useful.
Sections with this background give information about potential error messages or might encounter, or problems that might arise in your own data analysis.
The official Galaxy page has many tutorials on using the service, and examples of other types of analysis that can be performed on the platform.
Those eventually wanted to perform their own RNA-seq analysis (for example in R), should look out for other courses
Two workflows are possible with RNA-seq data - with the difference being whether one performs an alignment to the reference genome or not.
Recent tools for RNA-seq analysis (e.g. salmon, kallisto) do not require the time-consuming step of whole-genome alignment to be performed, and can therefore produce gene-level counts in a much faster time frame. They not require the creation of large bam files, which is useful if constrained by file space on Galaxy.
(image from Harvard Bioinformatics Core)
The data for this tutorial comes from a Journal of Experimental Medicine paper “Itraconazole targets cell cycle heterogeneity in colorectal cancer”. This study examines the expression profiles of two cell lines in response to treatment with itraconazole.
For this tutorial, we will assume that the wet-lab stages of the experiment have been performed and we are now in the right-hand branch of the workflow. In this tutorial we will demonstrate the steps of Quality assessment, alignment, quantification and differential expression testing.
The fastq data for this experiment were made available on the Sequencing Read Archive with accession SRP144496. For the purposes of this workshop we have created a downsampled dataset
Make sure you check your email to activate your account
Now, use this link to access a special queue. This link will only be active on December 6th.
The data for this course have all been shared on a google drive. If you have not done so already, please download this directory as a zip file
https://drive.google.com/open?id=1ftuBP5L-rcXwsEub2mIaSDJ1tceHYFc7
We can going to import the fastq files for this experiment. This is a standard format for storing raw sequencing reads and their associated quality scores. To make the practical quicker, we have downsampled the original fastq files to a quarter of a million reads.
The experimental design for the dataset is summarised in the table below.
| run | name | cell_line | condition |
|---|---|---|---|
| SRR7108388 | HT55_CONT_1 | HT55 | DMSO |
| SRR7108389 | HT55_CONT_2 | HT55 | DMSO |
| SRR7108390 | HT55_CONT_3 | HT55 | DMSO |
| SRR7108391 | HT55_CONT_4 | HT55 | DMSO |
| SRR7108392 | HT55_ITRA_1 | HT55 | ITRACONAZOLE |
| SRR7108393 | HT55_ITRA_2 | HT55 | ITRACONAZOLE |
| SRR7108394 | HT55_ITRA_3 | HT55 | ITRACONAZOLE |
| SRR7108395 | HT55_ITRA_4 | HT55 | ITRACONAZOLE |
| SRR7108396 | SW948_CONT_1 | SW948 | DMSO |
| SRR7108397 | SW948_CONT_2 | SW948 | DMSO |
| SRR7108398 | SW948_CONT_3 | SW948 | DMSO |
| SRR7108399 | SW948_CONT_4 | SW948 | DMSO |
| SRR7108400 | SW948_ITRA_1 | SW948 | ITRACONAZOLE |
| SRR7108401 | SW948_ITRA_2 | SW948 | ITRACONAZOLE |
| SRR7108402 | SW948_ITRA_3 | SW948 | ITRACONAZOLE |
| SRR7108403 | SW948_ITRA_4 | SW948 | ITRACONAZOLE |
You can import the data by:
fastq directory of the zip file that you downloaded from google drive and select these two files from the HT55-DMSO condition.SRR7108388.fastq.gz SRR7108389.fastq.gz
and these two files are from the HT55 ITRACONAZOLE condition.
SRR7108392.fastq.gz SRR7108393.fastq.gz
also upload the file Homo_sapiens.GRCh38.cdna.all.fa.gz, which is a reference file that we will use later.
SRR7108388.fastq.gzSRR7108389.fastq.gzSRR7108392.fastq.gzSRR7108393.fastq.gzThe .gz at the end of each file name means that it is compressed (like a zip file).
You can upload the other files for extra practice if you wish
FastQC is a popular tool from Babraham Institute Bioinformatics Group used for quality assessment of sequencing data. Most Bioinformatics pipelines will use FastQC, or similar tools in the first stage of the analysis. The documentation for FastQC will help you to interpret the plots and stats produced by the tool. A traffic light system is used to alert the user’s attention to possible issues. However, it is worth bearing in mind that the tool is blind to the particular type of sequencing you are performing (i.e. whole-genome, ChIP-seq, RNA-seq), so some warnings might be expected due to the nature of your experiment.
Question: Do the data seem to be of reasonable quality?
You can use the documentation to help interpret the plots
If poor quality reads towards the ends of reads are considered to be a problem, or there is considerable adapter contamination, we can employ various tools to trim our data.
However, a recent paper demonstrated that read trimming is no longer required prior to alignment:- https://www.biorxiv.org/content/10.1101/833962v1
If you also suspect contamination by another organism, or rRNA present in your data, you can use the sortMeRNA tool to remove this artefact.
It can be quite tiresome to click through multiple QC reports and compare the results for different samples. It is useful to have all the QC plots on the same page so that we can more easily spot trends in the data.
The multiqc tool has been designed for the tasks of aggregating qc reports and combining into a single report that is easy to digest.
FASTQ Quality Control -> Multiqc
Under Which tool was used generate logs? Choose fastqc and select the RawData output from the fastqc run on each of your bam files.
Question: Repeat the FastQC analysis for the remaining fastq files and combine the reports with multiQC. Do the fastq files seem to have consistently high-quality?
We can align our RNA-seq reads to a reference genome, and then overlap with know gene coordinates, but many prefer to align directly to the transcriptome sequences using a method such as salmon or kallisto. We will demonstrate the salmon protocol.
This file has been provided in the google drive folder
However, it is useful to know where this file came from in case you are not working with Human data. The file was obtained from Ensembl by clicking on the cDNA (FASTA) link for the appropriate organism (Human).
On the next screen, Right-click to save the .cdna.all.fa.gz to your computer
By default, salmon will produce counts for each transcript. This might be what we want, but for most standard analyses it is preferable to work at the gene-level. We therefore have to tell salmon how the transcripts in the cDNA file relate to known genes. Such a file can be obtained from biomart.
It is important that the file downloaded from Biomart is edited so that the column headings do not contain any spaces. You can do this in a text editor. The edited file should look like this.
It is important to make sure the version number of your transcript file and the biomaRt dataset are the same, otherwise some of the steps downstream might not work as expected.
If you have problems, this mapping file is also provided in the google drive as tx2gene.txt.
The Ensembl gene IDs are not particularly memorable, so it would be highly beneficial to have other annotations at hand to help us interpret the data downstream. We can use the biomart website again to produce a table to downstream intrepretation.
This time, select only the Gene Stable ID tickbox in the GENE box. Expand the EXTERNAL panel by clicking the “+” next to EXTERNAL, and select HGNC symbol and NCBI gene (formerly Entrezgene) ID
RNA Analysis -> Salmon quant
Two jobs will now be queued for each sample fastq file. The Quantification output will contain transcript-level data, and the Gene Quantification output will be at the gene-level. We should expect the number of lines in the Gene Quantification file to be substantially less. If not, you will need to check that your transcript mapping file was correct.
The Gene Quantification output from each sample comprises the following columns:-
Note that we are using a downsampled dataset, so the majority of NumReads will be zero.
Methods for detecting differential expression are likely to want data in the form of a table; where every row is a different gene and each column is a unique biological sample. Before we can proceed we will therefore need to merge our salmon results into a single output. This can be down using the Salmon quantmerge tool
RNA Analysis -> Salmon quantmerge
Use the +Insert Quant file and names button repeatedly to select each of your Gene Quantification outputs. The One-word sample names text box can be used to create a shorter column name for each output.
Once all the Gene Quantification files have been selected the drop-down menu under Columns should be changed from Length to NumReads.
After the tool has finished you should have a table with
Text Manipulation -> Join two files
If time allows, we will also follow this section
Mapping -> HISAT2
In the left tool panel menu, under NGS Analysis, select Mapping > HISAT2 and set the parameters as follows:
SRR7108388.fastq.gzSRR7108389.fastq.gzSRR7108392.fastq.gzSRR7108393.fastq.gzIn order to test for differential expression, we need to count up how many times each “feature” is observed in each sample. The goal of such operations is to produce a count table such as that shown below. We can then apply statistical tests to these data
HTSeq-count creates a count matrix using the number of the reads from each bam file that map to the genomic features. For each feature (a gene for example) a count matrix shows how many reads were mapped to this feature.
Various rules can be used to assign counts to features
To obtain the coordinates of each gene, we can use the UCSC genome browser which is integrated into Galaxy.
RNA Analysis > htseq-count
The htseq tool is designed to produce a separate table of counts for each sample. This is not particularly useful for other tools such as Degust (see next section) which require the counts to be presented in a data matrix where each row is a gene and each column is a particular sample in the dataset.
Collection Operations -> Column Join on Collections
ht-seq count files from your history SRR1552444.htseq, SRR1552450, etc… Holding the CTRL key allows multiple files to be selected1Differential expression is possible using Galaxy using the DESeq2 tool (for example). However, our particular recommendation is to use Degust for a more interactive experience. For this section, we will be using counts generated on the full dataset, rather than the downsampled data analysed in the previous section. These counts are available in the file GSE114013_salmon_counts.csv.
The term differential expression was first used to refer to the process of finding statistically significant genes from a microarray gene expression study.
Such methods were developed on the premise that microarray expression values are approximately normally-distributed when appropriately transformed (e.g. by using a log\(_2\) transformation) so that a modified version of the standard t-test can be used. The same test is applied to each gene under investigation yielding a test statistic, fold-change and p-value. Similar methods have been adapted to RNA-seq data to account for the fact that the data are count-based and do not follow a normal distribution.
Degust is a web tool that can analyse the counts files produced in the step above, to test for differential gene expression. It offers and interactive view of the differential expression results
The input file is a count matrix where each row is a measured gene, and each column is a different biological sample. Within the tool we can configure which samples belong to the different biological groups of interest.
Read counts have to be normalised first prior to differential expression testing. There are two main biases that need to be accounted for:-
However, R-based methods such as edgeR (implemented in Degust) and DESeq2 have their own method of normalising counts. You will probably encounter other methods of normalising RNA-seq reads such as RPKM, CPM, TPM etc. This blog provides a nice explanation of the current thinking. As part of the Degust output, you have the option of downloading normalised counts in various formats. Some other online visualisation tools require normalised counts as input, so it is good to have these to-hand.
GSE114013_salmon_counts.csv, and click Open.| run | name | cell_line | condition |
|---|---|---|---|
| SRR7108388 | HT55_CONT_1 | HT55 | DMSO |
| SRR7108389 | HT55_CONT_2 | HT55 | DMSO |
| SRR7108390 | HT55_CONT_3 | HT55 | DMSO |
| SRR7108391 | HT55_CONT_4 | HT55 | DMSO |
| SRR7108392 | HT55_ITRA_1 | HT55 | ITRACONAZOLE |
| SRR7108393 | HT55_ITRA_2 | HT55 | ITRACONAZOLE |
| SRR7108394 | HT55_ITRA_3 | HT55 | ITRACONAZOLE |
| SRR7108395 | HT55_ITRA_4 | HT55 | ITRACONAZOLE |
| SRR7108396 | SW948_CONT_1 | SW948 | DMSO |
| SRR7108397 | SW948_CONT_2 | SW948 | DMSO |
| SRR7108398 | SW948_CONT_3 | SW948 | DMSO |
| SRR7108399 | SW948_CONT_4 | SW948 | DMSO |
| SRR7108400 | SW948_ITRA_1 | SW948 | ITRACONAZOLE |
| SRR7108401 | SW948_ITRA_2 | SW948 | ITRACONAZOLE |
| SRR7108402 | SW948_ITRA_3 | SW948 | ITRACONAZOLE |
| SRR7108403 | SW948_ITRA_4 | SW948 | ITRACONAZOLE |
(Not that the screenshots are for illustration purposes and taken from a different dataset to that being analysed in the tutorial)
Each dot shows the change in expression in one gene.
Click on the dot to see the gene name.
Each line shows the change in expression in one gene, between control and treatment.
Table of genes
The table can be sorted according to any of the columns (e.g. fold-change or p-value)
Above the genes table is the option to download the results of the current analysis to a csv file. You can also download the R code required to reproduce the analysis by clicking the Show R code box underneath the Options box.
Plots such as the MDS, MA and heatmap can also be exported by right-clicking on the plot.
This is a multidimensional scaling plot which represents the variation between samples. It is a similar concept to a Principal Components Analysis (PCA) plot. The x-axis is the dimension with the highest magnitude. In a standard control/treatment setup, samples should be split along this axis. A desirable plot is shown below:-
Question: It seems that the differential expression analysis is detecting lots of genes. However, does this tell the whole story about the dataset? What do you think is the main factor separating samples on the x-axis, and thus explaining the most variation in the data?
We will now repeat the analysis, but only for samples from the HT55 cell-line. The correct configuration for this analysis is shown below:-
Exercise: How many genes are differentially-expressed with an FDR < 0.05 and abs logFC > 1. Download this file and rename it to HT55.ITRACONAZOLE_vs_DMSO.csv.
Exercise: Repeat the analysis for SW948 samples and download the table of differentially-expressed results (same FDR and log fold-change) to SW948.ITRACONAZOLE_vs_DMSO.csv
Exercise: Rest the FDR cut-off and abs LogFC cutoffs back to default and download the file and rename to background.csv. We will use this later.
If you didn’t manage to complete these analyses, you can download the files from here by right-clicking on each link and selecting “Save Link as” (or equivalent). They are also available in the course google drive.
We might sometimes want to compare the lists of genes that we identify using different methods, or genes identified from more than one contrast. In our example dataset we can compare the genes in the contrast of ITRACONAZOLE vs DMSO in HT55 and SW948 cells
The website venny provides a really nice interface for doing this.
The final analysis we will perform is to include all the samples, but correct for the differences in cell-line. This is achieved by telling Degust about the hidden factors in our dataset. The hidden factor in this dataset is whether the sample is from the HT55 or SW948 samples. However, we only need to specify which samples are from HT55. Other hidden factors you might need to include could be (depending on the MDS plot) :-
See below for the correct configuration to include the hidden factors.
You are now ready to complete the final section on annotation and enrichment analysis